Language-independent text categorization by word N-gram using an automatic acquisition of words

نویسندگان

  • Makoto Suzuki
  • Naohide Yamagishi
  • Yi-Ching Tsai
  • Masayuki Goto
چکیده

We previously proposed the accumulation method, a language-independent text classification method that is based on character N-grams. The accumulation method does not depend on the language structure because this method uses character N-grams to form

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بازشناسی متون فارسی با استفاده از مدل زبانی n-gram و پالایش گرامری

Abstract Text recognition has been one of the growing research topics in recent years. Many of these researches have focused on recognition of letters and sub-words as a basis for identifying larger text structures such as words, phrases and sentences. This thesis presents a new method in which the recognized sub-words are combined in order to provide meaningful words and sentences in Farsi tex...

متن کامل

Text Categorization Using n-Gram Based Language Independent Technique

This paper presents a language and topic independent, bytelevel n-gram technique for topic-based text categorization. The technique relies on an n-gram frequency statistics method for document representation, and a variant of k nearest neighbors machine learning algorithm for categorization process. It does not require any morphological analysis of texts, any preprocessing steps, or any prior i...

متن کامل

The Use of Topic Representative Words in Text Categorization

We present a novel way to identify the representative words that are able to capture the topic of documents for use in text categorization. Our intuition is that not all word n-grams equally represent the topic of a document, and thus using all of them can potentially dilute the feature space. Hence, our aim is to investigate methods for identifying good indexing words, and empirically evaluate...

متن کامل

Comparing Neural Network Approach With N- Gram Approach For Text Categorization

This paper compares Neural network Approach with N-gram approach, for text categorization, and demonstrates that Neural Network approach is similar to the N-gram approach but with much less judging time. Both methods demonstrated here are aimed at language identification. The presence of particular characters, words and the statistical information of word lengths are used as a feature vector. I...

متن کامل

A Study Using n-gram Features for Text Categorization

In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013